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ABSTRACT 


To characterize economic development and diagnose the economic health condition, several popular indices such as gross 
domestic product (GDP), industrial structure and income growth are widely applied. However, computing these indices based 
on traditional economic census is usually costly and resources consuming, and more importantly, following a long time delay. 
In this paper, we analyzed nearly 200 million users’ activities for four consecutive years in the largest social network (Sina 
Microblog) in China, aiming at exploring latent relationships between the online social activities and local economic status. 
Results indicate that online social activity has a strong correlation with local economic development and industrial structure, 
and more interestingly, allows revealing the macro-economic structure instantaneously with nearly no cost. Beyond, this work 
also provides a new venue to identify risky signal in local economic structure. 


Introduction 

With the fall of the Iron Curtain and the upheaval of East European, the international environment has changed profoundly, in 
which security concerns are gradually dominated by economy instead of military considerations.'’^ in fact, economic status 
not only drives the development of national defense, but more directly, affects our lives ranging from personal investment 
to policy-making in government. For instance, it is crucial for a mayor to consider the local economic structure in this 
city whenever he or she tends to make economic-related policies. As for ordinary people, the economic development usually 
accompanies with considerably improvement of the living standards. Meanwhile, our money-oriented plans or decisions 
explicitly or implicitly depend on the local or national economic status. How to provide an efficient and comprehensive view 
of the health of economy is thus of great importance for governments as well as individuals. Currently, economic census 
provides a straightforward way to present a clear picture of many facets of the national or global economy in terms of many 
indices such as Gross Domestic Product (GDP), industrial structure and income growth.®"^ However, computing these indices 
is usually a non-trivial task as they are always involved with considerable resources for a long time. For example, the widely 
accepted index, GDP, is defined as “an aggregate measure of production equal to the sum of the gross values added of all 
resident institutional units engaged in production”.^ The calculation of such index needs to collect data from distinct local 
governments and then integrate all data together. During this process, there are two predominant factors arisen. Firstly, the 
economic census requires a large quantity of manual labour, materials and other related resources. And more importantly, 
the procedure is time consuming and thus many economy-oriented decisions cannot be made in time. Although some new 
economic census techniques have been developed in recent decades to speed up the process such as sampling, they often 
suffer in accuracy as the statistics are derived from the partial rather than the whole data. In light of these problems, in 
recent years a few indirect indices have been introduced to quickly reflect the economic status. A famous index, Keqiang 
Index}^ consisting of three indicators: the railway cargo volume, electricity consumption and loans disbursed by banks, is 
proposed to measure the economy of China. Although the index allows reflecting the economic development, it is insufficient 
to provide a comprehensive overview of the economy as the measure heavily relies on industry and largely ignores agriculture 
and services. The Producer Price Index (PPI) is another popular index, which is used to measure the average changes in 
prices received by domestic producers for their output." This measure manifests the economic effects on people’s daily 
life, providing a potential hint of the stability of economy and society. Although there also exist some more complicated 
indices such as Consumer Price Index (CPI), Social Retail Goods (SRG) and Foreign Direct Investment (FDI) to characterize 
economic development or economic structure, it is a non-trivial task to obtain them due to their long-term data collection and 
calculation procedures. Therefore, a fast, effective and comprehensive strategy to bring deep insight into the economic status 
is highly desired. 


Table 1 . The statistics of user data of Sina Mircoblog. N, No and Nl represent the total number of registered users, the 
ordinary users and the ordinary users with location information (including prefecture-level cities) from the year 2009 to 2012, 
respectively. 


User 

2009 

2010 

2011 

2012 

N 

2,025,595 

38,263,550 

82,934,212 

75,242,922 

No 

1,841,346 

35,370,466 

79,462,793 

73,718,770 

Nl 

900,632 

15,144,018 

37,563,752 

32,898,514 


Social networks, such as Facebook, Twitter and Sina Microblog, are becoming the primary venues for people to obtain 
and share information on a global scale. With no doubt, the new social media is mainly driven by the advances of 
information technology.*'* However, many researches also have demonstrated that national economy status and policies play 
an important role on the growth rate, diversity and stability of social networks. For instance, Katona et al. have found 
that economy has significant effects on the structure of World Wide Web. loannides et al. have studied individual outcomes 
in a dynamic environment in the presence of social interactions. Results show that the topology of social interactions is 
temporally changed once the individual outcomes vary continuously. Besides of the economists, socialists also point out 
that social networks and economic networks are mutually interrelated.*®"^* Actually, social networks permeate our social 
and economic lives and play a central role in the transmission of information about job opportunities, and are critical to the 
trade of many goods and services.*® Eagle et al.^^ showed that the regional communication diversity is strongly correlated 
with economic development. Furthermore, the results in^* have showed high correlation in many of the cases revealing the 
diversity of socio-economic insights that can be inferred using only mobile phone call data. Moreover, Bettencourt et al.^^ 
have shown the strong relationships among several social and economic indices. Beyond, some companies have attempted 
to utilize online networks properties to reflect economic status, e.g. Taobao CPI.^^ Motivated by these studies, in this paper, 
we concentrate on analyzing and quantifying the latent relationship between economic status and online social activities, and 
propose a simple yet effective method to analyze the macro economic structure in a data mining framework. Building upon the 
data-driven analysis, the work has several interesting findings: (i) The online social activity shows a strong linear correlation 
with economic development; (ii) The macro economic structure can be well reflected by social activity; (iii) Online social 
activity allows analyzing the economy status instantaneously and thus support in-time decisions from individuals to countries. 

Materials and Methods 

Data Acquisition and Description 

Here, to investigate the latent relationships between the social activities and economic status, we focus on the primary social 
network in China: Sina Microblog (SM), and the economic data is derived from the National Statistic Bureau of the People’s 
Republic of China. 

Sina Microblog. This data was collected from Sina Microblog (www.weibo.com), which is the leading social network in 
China and was launched in 2009 by Sina Corporation. Like Twitter, approximate 100 million messages have been posted each 
day on this platform.^'*’Here we collect nearly 200 million online registered users from the year 2009 to 2012. For each 
microblogger, the registered date, verified information and location information including province and prefecture-level city 
are all examined. More than 97% microbloggers are ordinary users (opposed to verified users). The basic statistics of the data 
set is summarized in Table 1 . 

National Economic Data. The national economic data has been collected from the official book entitled “China City 
Statistical Yearbook”, which has been published by National Statistic Bureau (NSB) of the People’s Republic of China in 
each year. In the statistical yearbook, major economic and social indices are reported, such as the total population, resident 
population, GDP, average GDP, industrial structure, to mention a few. Due to the time-consuming data collection and calcula¬ 
tion procedures, the statistical yearbook cannot be published at the same year, but usually with about one year delay. From the 
books, we have collected and integrated the total population, resident population, GDP, average GDP and industrial structure 
of 282 prefecture-level cities (see Supplemental Information for the reason to choose prefecture-level cities as well as the list 
of city names.) in China ranging from 2008 to 2012, respectively. The distributions of the number of registered users in Sina 
Mircoblog and the values of GDP in the 282 prefecture-level cities are presented in Figure 1. 

Correiation Anaiysis 

Here, we report two measures to exploit the relationship between online social activity and economic status upon the above 
two data sets. 
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User Distribution (unit: 10 thousand) 


GDP Distribution (unit: 1 billion RMB) 




Figure 1 . The distributions of registered users in Sina Mircoblog (left) and the values of GDP (right) in the 282 
prefecture-level cities of China in 2012. 


Pearson correlation coefficient is a measure to model the linear correlation between two random variables X = {xi,-- - ,x„} 
and Y = {yi, • • • ,y„} (n is the dimension of X and Y), yielding a value ranging from -1 (completely negative correlation) to 
1 (completely positive correlation)^® (see also the non-trivial bounds of Pearson correlation coefficient for heterogeneous 
systems^^). Formally, the Pearson correlation coefficient r is defined as: 


Here x = ^ T!l=iXi, y = -^ L?=i n is the dimension of X and Y. 

Spearman’s rank correlation coefficient is a nonparametric measure of dependence between two variables. The strongest 
spearman correlation with value of 1 or -1 indicates a perfect monotone function of the other and there are no repeated values 
in the two variables.For calculation, both of variables, X and Y, must be converted to ranks, X and Y. And the fth element 
of X, Ti, represents ranking order of x,- in all elements of X. Finally, Spearman’s rank correlation coefficient p is defined as 
follows. 


P = 


n(rfi — 1) ’ 


( 2 ) 


where dj = x, — y, is the difference between ranks. 


Predicting Economic Structure 

Here, we introduce support vector regression to uncover the macro economic structure based on online social activity. 

Support Vector Regression (SVR): The objective of SVR is to find a function /(x) that has at most e deviation from the 
true target y,- for all the training data, and is as flat as possible.^® 


/(x) = (w,x) +b,w G € M, 


(3) 


minimize 


subject to 


f yi - {w,x) - b <e 
\ {w,x)pb-yi<e 


(4) 
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Figure 2. Relationships between the online social activity (i.e. the number of online registered users in Sina 
Microblog) and economic indices from 2009 to 2012, respectively. Here P, RP, AG, GDP and UN represent population, 
resident population, average gross domestic product, gross domestic product and the number of registered users in Sina 
Microblog, respectively. 


To make SVR to handle nonlinear practical problems, the kernel trick is usually applied. The basic idea is to map the 
data into a high dimensional feature space via a mapping function <I> and to do linear regression in the new space. Gen¬ 
erally, there are several widely used kernel functions K{xi,Xj): (i) Linear kernel; K{xi,Xj) = {xi -Xy); (ii) Polynomial kernel: 
K(xi,Xj) = {xi -Xj + cY, where d > 0; (iii) Gaussian radial basis function (RBF): K{xi,Xj) = ^ where 7 > 0; (iv) 

Hyperbolic tangent: K{xi,Xj) = tanh{kxi -xj -|-c). In this study, the Gaussian radial basis function is applied. In addition, to 
reduce variability of results and to evaluate how results generalize to an independent data set, we apply leave-one-out cross 
validation procedure.^® 

Evaluation. To evaluate the prediction performance, here we use two metrics: the root mean square error (RMSE) and 
the relative error (RE). Eor both metrics, the smaller the value is, the better the performance is. 

RMSE is a popular metrics to capture the difference between the predicted value and true value. Eormally, it is defined as: 


RMSE = ./ ^-•=i (5) 

V n 

where y,- and y,- are true and predicted values, respectively and n is the size of the testing set. Given true value y,- and its 
predicted value y,-, if y, Y 0, RE is defined as 



(6) 
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Figure 3. The correlation between online social activity and macro-economic structure. (A) The scaling of the number 
of registered users with GDP during 2009 and 2012. The number on the top of each column represents the year. In each 
figure, the solid line is the fitting line via the least square method, and the dash line is parallel with the fitting line. (B) The 
histograms for the three-sector ratios of some selected outlier cities in distinct years as marked in (A). Here ri, r 2 and 
represent the percentage proportions of primary, secondary and tertiary sectors, respectively. 


Results 

Relationship between Online Social Activity and Economic Status 

Based on the correlation analysis, it is very interesting that the number of online registered users is highly related to economic 
indices such as population, resident population, average gross domestic product and gross domestic product (see Table 2 and 
Figure 2). Concretely speaking, the registered user number (UN) of Sina Microblog has positive correlations with population 
(P) ranging from 0.31 to 0.37 in different years. The result is in line with the real-world phenomenon that there will be more 
people use Sina Mircoblog if the local population itself is much larger. Similarly, UN has positive correlation with resident 
population (RP), but with a higher value. The reason is that RP accurately reflect the amount of inhabitants. In addition to 
population-related indices, UN also shows a strong correlation with the average gross domestic product (AG), suggesting that 
people in a richer city are more likely to use new social media. In contrast, the GDP has a surprisingly high correlation with 
UN (e.g., r Ki 0.88 and p — 0.90), indicating that the city-level online social activity might be mainly determined by two 
factors: individual wealth and papulation size. 

Table 2. The correlation between UN and economic indices including P, RP, AG and GDP. r and p represent the Pearson 
correlation coefficient and spearman’s rank correlation coefficient, respectively. In each row, the highest value is emphasized 
in bold. 



Pearson Correction 

Spearman’s Rank Correlation 


P-UN 

RP-UN 

AG-UN 

GDP-UN 

P-UN 

RP-UN 

AG-UN 

GDP-UN 

2009 

0.3235 

0.5574 

0.5080 

0.8780 

0.5928 

0.6836 

0.5237 

0.8988 

2010 

0.3117 

0.5249 

0.4661 

0.8487 

0.5995 

0.7130 

0.4906 

0.8950 

2011 

0.3199 

0.6025 

0.4305 

0.8560 

0.5844 

0.7130 

0.4673 

0.8709 

2012 

0.3390 

0.6193 

0.4269 

0.8632 

0.5944 

0.7233 

0.4427 

0.8638 


Uncovering Macro Economic Structure via Online Social Activity 

As demonstrated above, the online social activity shows a strong relation with local economic status. Beyond that, the online 
social activity also well reflects the city-level macro economic structure. For each year. Figure 3(A) marks the names of cities 
that far away from the fitting lines via the linear least square method. To bring deep insight into these cities, we found that 
the points below the fitting line (with larger number of registered users and a relatively lower GDP) tend to be service-driven 
cities. For instance, Lhasa, Sanya, Haikou and Xi’an, are identified as the top four cities with the largest distance below the 
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Table 3. The RMSE of the three methods to predict the distribution of GDP among the three sectors (primary, secondary 
and tertiary industries). In each row, the highest value is emphasized in bold. 



Primary 

Secondary 

Tertiary 


Doo 

Ecoindex 

Random 

Doo 

Ecoindex 

Random 

Doo 

Ecoindex 

Random 

2009 

0.0846 

0.0845 

0.4603 

0.1038 

0.1080 

0.3045 

0.0800 

0.0830 

0.3378 

2010 

0.0815 

0.0817 

0.4615 

0.1016 

0.1064 

0.3087 

0.0815 

0.0850 

0.3329 

2011 

0.0805 

0.0810 

0.4587 

0.0999 

0.1042 

0.3021 

0.0830 

0.0866 

0.3337 

2012 

0.0800 

0.0807 

0.4600 

0.0985 

0.1029 

0.3002 

0.0855 

0.0881 

0.3392 


Table 4. The performance of GDP prediction in terms of RMSE and RE using different information. P-G: using the 
population of the last year to predict GDP of the next year. Similarly, RP, AG and Integrated stand for the cases of using the 
last year’s resident population, average GDP and the integration of population, resident population and average GDP, 
respectively. UN-P; using the number of registered users to predict GDP in the same year. All variables above are expressed 
in logarithmic form. In each row, the highest value is emphasized in bold. 



P-G 

RP-G 

AG-G 

Integrated-G 

UN-G 

RMSE 

RE 

RMSE 

RE 

RMSE 

RE 

RMSE 

RE 

RMSE 

RE 

2009 

0.3212 

0.0358 

0.2980 

0.0332 

0.3161 

0.0365 

0.2916 

0.0338 

0.1786 

0.0203 

2010 

0.3174 

0.0349 

0.2980 

0.0330 

0.3148 

0.0357 

0.2889 

0.0329 

0.1822 

0.0203 

2011 

0.3133 

0.0344 

0.2865 

0.0317 

0.3138 

0.0352 

0.2911 

0.0329 

0.1952 

0.0216 

2012 

0.3062 

0.0332 

0.2728 

0.0299 

0.3147 

0.0351 

0.2952 

0.0331 

0.1951 

0.0213 


fitting line in 2009. Eor all these cities, the common salient feature is the boom of tourism. In contrast, the cities over the 
fitting line focus on the heavy industry, such as Zhongwei, Laibin, Chongzuo and Ordos in 2009. In addition, these outliers 
are almost consistent across different years from 2009 to 2012. Eigure 3(B) further plots the macro economic structure (the 
distributions of the three sectors: primary, secondary and tertiary) of these cities in different years. Hence, to further test our 
hypothesis, we introduce a measure on the distance between the offline economic output (i.e. GDP) and online social activity 
(i.e. UN), with the fitting lines in Eigure 3(A) as reference. Eormally, the Doo quantifies the deviation from the fitting line as: 


Doo = li-yi (7) 

where i is the label of the target city, y,- = logioGDPi is the value of f s GDP in the logarithmic coordinate and /, is the value on 
the fitting line for the registered number of users in the logarithmic coordinate at the city i. Eigure 4 depicts the correlations 
between Doo and the three sectors at different years. Doo shows positive linear correlation with the primary and tertiary 
industry, the Doo negative linear correlation with the secondary industry. 

Therefore, we expect to use the proposed index to predict the macro-economic structure. To objectively evaluate the 
prediction performance, three strategies were applied based on the support vector regression (see Materials and Methods), 
(a) Doo: using the single index Doo to predict the distribution of GDP among three sectors; (b) Ecoindex: using the four 
traditional economic indices: P, RP, AG and GDP for prediction; (c) Random: randomly generate a value between the minimal 
value and the maximal value in the training set as the prediction. Table 3 gives a summary of the prediction performance of 
the three strategies. It is interesting to see that Doo allows well reflecting the macro economic structure. 

GDP Prediction 

Considering the time and resources consuming procedure of GDP calculation in traditional economic census, we use UN to 
predict GDP building upon their latent strong correlation. Here we apply the highly related economic indices (population, 
residence population and average GDP) and online social activity (i.e. registered user number in Sina Microblog) to predict 
GDP, respectively. Table 4 gives the prediction accuracy based on different information, suggesting that the online social 
activity allows the most effective GDP prediction. 

Discussion 

As the prosperity of economy is resulted from the aggregation of human activities, the economic status could be correlated 
with other human-related and human-activated systems, such as the structure of commercial webs,^^ energy consumption,^* 
regional communication diversity^** and city size.^^ Nowadays, online social networks, such as Eacebook, Twitter and Sina 


6/9 













60 


2009 


2010 


2011 


2012 



Figure 4. The relationships between Doo and economic sectors. Doo represents the deviation between the fitting line and 
ture GDP in the logarithmic form. Primary (%), Secondary (%) and Teriary (%) represent the percentage proportions of 
primary, secondary and tertiary sector to the whole GDP, respectively. The relationships in different years, ranging from 
2009 to 2012, are plotted as fifferent columns. 


Microblog, have been permeating every aspect of our social and economic lives. Therefore we need to give a close look at the 
relationship between online social activity and economic status. In this paper, to the first time, we quantitatively explore the 
potential relationship between the major economic indices of 282 prefecture-level cities of China and 2x10^ users’ activities 
of the Chinese largest social networks (i.e., Sina Microblog). Empirical results show that the economic indices and the number 
of registered users in Sina Microblog is closely correlated (see Table 2 and Figure 2). Statistically speaking, people in more 
developed areas more frequently use the online social media. 

The uncovered strong correlation further allows forecasting GDP in an effective and efficient way via support vector 
regression (see Table 4). Comparing with the highly resource-consuming methods in traditional economic census, our methods 
have three remarkable properties. Firstly, due to the strong correlation between online social activity and economic status, it 
allows for accurate prediction. Secondly, the online social activity can be collected at any time, and thus we can analyze 
economy status instantaneously and support in-time decisions. Finally, collecting the online social activity has almost no cost 
compared to the national economic census. 

More interestingly, the online social activity can reflect the macro-economic structure of cities and thus catch sight of 
outliers, some of which may have ill-posed economic structure and be fragile against the changes of the external economic 
environment. For instance, Ordos, located in Inner Mongolia, is one of the richest regions of China, whose nominal per-capita 
GDP is once ranked ahead of Beijing. This city has 1/6 of the national coal reserves, 1/3 of the national natural gas reserves 
and 1/2 of the national kaoline pockets. Apparently, coal mining, petrochemicals and production of building materials had 
become the pillars of its economy.In its golden times, Ordos is not aware of the latent risk embedded in its less-diverse 
economic structure that is highly dependent on the prices of coal and gas. The housing price once rises to a unimaginable high 
level and recently, triggered by the drop of prices of coal and gas, the housing price of Ordos also drops very rapidly, leading 
to the overall economic collapse. Chongzuo, located in southwestern Guangxi, is also rich in minerals included manganese, 
gold, coal, and so on. It is the Chinese biggest manganese producer and the world’s biggest producer of bentonite. In addition, 
its pillar industry is sugar refining.With the decline sugar industry, like Ordos, Chongzuo is also exposed to adverse 
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conditions through a process of economic restructuring.^^ Laibin^^ and Zhongwei^^ have undergone the similar tragedy for 
their over dependence on mineral price. All these four cities have been detected by our simple analysis (see Figure 3). On the 
other hand, economies of cities like Sanya and Haikou (see also Figure 3) heavily rely on the local services, which are very 
sensitive to the tourism. Although there are many previous studies showed the correlations with economic structure, such as 
environmental quality,^^ warfare and political regime,our study provides a simple, quick, yet effective way to predict the 
local macro-economic structure (see Table 3), and more importantly, gives a hint to the health of local economic structure. 

In summary, by coupling the population-level online social activity with economic outcomes in prefecture-level cities, we 
were able to conclude that the population-evolved activity of emerging online social media has a close relation with local 
economic status. Although causal relation cannot be established yet, online social activity seems to be a highly sensitive 
barometer of economic conditions. This interesting correlation further suggests a new venue for observing the health of 
economy, which may provide novel insights in addition to the statistics of the traditional economic census. 
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